Natural Language Interaction with Large Databases

ABSTRACT

A method includes applying at least one tag to at least one data element stored in a database the tag having at least one associated rule, utilizing the at least one associated rule to generate at least one variant of the data element, and storing the at least one variant in the database.

TECHNICAL FIELD

This invention relates generally to a method and apparatus forgenerating text variants in databases.

BACKGROUND

It is known in the art to provide natural language access to largedatabases such as those comprised of telephone directories, stocklibraries, book libraries, and the like. Request for data from suchdatabases are often written in natural text or spoken and converted intotheir textual content. Similarly, responses to requests are likewiseprovided in either a textual format or converted to spoken language.

Ideally, every request would recite a portion of the desired dataelement to be accessed verbatim so as to aid in identifying preciselywhich data element is desired. Unfortunately, the format of the datastored in such databases, usually in a text format, often times differsignificantly from the format in which such data is requested. Forexample, words or phrases contained in the text may be omitted or added.In addition, the order of words may be changed. Other words may besubstituted for with synonyms, while in other instances, paraphrasingmay be employed.

The result of such discrepancies is that it is not possible to matchrequests for data with the data requested.

SUMMARY OF THE PREFERRED EMBODIMENTS

In an exemplary embodiment of the invention, a method includes applyingat least one tag to at least one data element stored in a database thetag having at least one associated rule, utilizing the at least oneassociated rule to generate at least one variant of the data element,and storing the at least one variant in the database.

In another exemplary embodiment of the invention, a system includes adatabase in which is stored at least one data element, means forapplying at least one tag to the at least one data element the taghaving at least one associated rule, means for utilizing the at leastone associated rule to generate at least one variant of the dataelement; and means for storing the at least one variant on the database.

In yet another exemplary embodiment of the invention, a signal bearingmedium tangibly embodies a program of machine-readable instructionsexecutable by a digital processing apparatus to perform operations togenerate variants of data elements, the operations including applying atleast one tag to at least one data element stored in a database, the taghaving at least one associated rule, utilizing the at least oneassociated rule to generate at least one variant of the data element andstoring the at least one variant in the database.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other aspects of these teachings are made more evidentin the following Detailed Description, when read in conjunction with theattached Drawing Figures, wherein:

FIG. 1 is a flow chart of an exemplary method by which a data element istagged.

FIG. 2 is an illustration of an exemplary embodiment of a parse tree forthe data element of FIG. 1.

FIG. 3 is an illustration of an exemplary method by which a data elementis tagged.

FIG. 4 is an illustration of an exemplary embodiment of a parse tree forthe data element of FIG. 3.

FIG. 5 is an illustration of an exemplary method of the invention.

FIG. 6 is a diagram of an exemplary system for practicing the invention.

FIG. 7 is a flow chart of a further exemplary method of the invention.

DETAILED DESCRIPTION

An aspect of this invention addresses a need for a method of augmentingan existing database to contain alternate listings, or variants, ofexisting data elements to increase the likelihood that requests can bemapped to desired responses. In an exemplary embodiment of theinvention, a two step technique is employed whereby data is manuallytagged and a transformation procedure is subsequently applied to thedata via the application of rules associated with the tags. Once tagged,the transformation procedure generates a multitude of variants of theoriginal data to which the tags have been applied. As is described morefully below, the method by which the manual tagging of the data isperformed enforces an advantageous uniformity over the manner in whichvariants are generated. In addition, once tagged, the rules associatedwith each tag may be altered and updated as required allowing for theautomated regeneration of variants.

In an exemplary embodiment of the invention, data is manually parsedthrough the application of tags to the data. As used herein, and not asa limitation, “data” refers to text strings. A text string is formed ofa plurality of binary values, typically bytes, wherein each typicallycorresponds to a single character in an alphabet such as ASCII orEBCEDIC. Such text strings typically describe entities such as, forexample, “Chen, Stanley, Md.”. Note that this example of data comprisesa listing containing three pieces of discernible information.Specifically, the individual's first name is “Stanley”, the last name is“Chen”, and the individual's profession is that of an “MD”. Were thisinformation to be stored in a relational database, a table formed of atleast three fields representing the first name, last name, andprofession would be utilized. One could then query the database using,for example, structured query language (SQL) to find and retrieveinformation contained in any of the fields.

Often times data is not, in its original form, particularly well suitedfor storage in a relational database. An example of such data is thetextual data that forms directory listings such as phone books and thelike. There are few if any format requirements imposed upon such dataand, as a result, the data is not easily broken up into individualcolumn entries of a table such as are utilized in relational databases.In addition, the manner in which such data is routinely queried lacksformal structure. As a result, such queries often times cannot be issuedin languages such as SQL.

As noted above, requests to identify a particular item of data within adatabase containing text strings are often transcribed or converted fromthe spoken word. For example, a request to retrieve the above notedsample entry might request “Dr. Chen”, “Mr. Chen, MD”, “Dr. StanleyChen” and various other variations, or variants.

Examples of data entries and a query that might be issued to retrievethe data entry are as follow:

-   -   Entry 1: “Bank A; Departments; Small Business; Card Merchant        Services 2123847402 NEW YORK CITY”    -   Query 1: “Bank A merchant services”    -   Entry 2: “U.S. Government; Congress; Senators; Doe John;        Washington D.C. Office 2022343445 SAINT PAUL”    -   Query 2: “office of senator John Doe”

Regardless of the format of the request, in the preceding example, it isnevertheless required that the request be effectively mapped to theentry “Chen, Stanley, MD”.

In an exemplary embodiment of the invention, each data entry element istagged and the tags are used to generate a multitude of variants. Asdescribed more fully below, there is at least one rule associated witheach tag. Once a text string is tagged, the rules associated with thetags are applied in an automated fashion to generate multiple variantsof the original text string.

With reference to FIG. 1, there is illustrated an exemplary applicationof tags 13 to a data element 11. In the example, data element 11 is atext string formed of the text “Valley Brook City of”. At step 1, afirst level parse is manually performed. As used herein, a “parse” isany application of tags 13 to either a data element 11 or another tag 13that serves to define an attribute of the tag 13 or data element 11 towhich the tag 13 is applied. Furthermore, a “first level parse” refersto an initial parsing of a data element 11, while a “second level parse”refers to a parse performed upon the output of a first level parse, andso on. As is therefore evident, more than one parse can be applied to adata element 11 and, in practice, it is likely that different dataelements 11 will be subjected to differing numbers of parses.

Returning to the example, a first level parse of the data elementresults in two tags 13, <locality> and <dept of >, being applied to thedata element 11. Specifically, the <locality> tag 13 is assigned thevalue of “Valley Brook” and the <deptof> tag 13 is assigned the value of“City of”. Applying a second level parse at step 2, the <X+deptof> tagis assigned two children tags 13, specifically, <locality> and <deptof>.By way of explanatory convention, the results of the parse may bewritten as:

Valley Brook City of data element <−locality−><−deptof−> level 1 parse<------X+deptof------> level 2 parse

With reference to FIG. 2, there is illustrated the hierarchicalrelationship of the exemplary tags 13 applied to the data element 11 asdescribed above so as to form a parse tree 15. As is evident from itsdescription, the tag <X+deptof> defines the combination of the tag<deptof> with at least one other, not specifically specified, tag 13. Inthe example shown, the additional tag is <locality>. In the parse tree15, tags <locality> and <deptof> form the children of tag <X+deptof>.Once tagged, the rules associated with each tag 13 may be applied in atop down fashion starting at the top most tag in the parse tree 15 andproceeding until all possible variants have been generated.

As described herein, the rules associated with each tag 13 are describedin terms of their functioning without reference to the manner in whichsuch functionality is implemented. It is understood that any number ofsuitable methodologies involving the execution of computer code can beimplemented to both encode the logic associated with each rule as wellas to implement the logic so encoded. As illustrated, an exemplary ruleassociated with the tags 13 of FIGS. 1 and 2 might appear in pseudo-codeas:

<X+deptof>=<X>_<deptof> and <deptof> X.

Such pseudo-code is read to define the resolving of <X+deptof> to twovariants. Specifically, the first variant is formed of the data elementassociated with the child of <X+deptof> that is not <deptof> followed bya space, “ ”, followed by the data element associated with <deptof>. Thesecond variant is formed of the data element associated with <deptof>followed by a space, “ ”, followed by the data element associated withthe child of <X+deptof> that is not <deptof>. The result of applyingthis rule results in the variants, “Valley Brook City of” and “City ofValley Brook”. Note that, as defined, the exemplary variant generationrequired only a single level of resolving to generate the variants. Thisresults from the fact that the parse tree 15 defines parent node<X+deptof> as having only one layer of children nodes, or tags 13,beneath it. Therefore, proceeding from the top of parse tree 15 at tag<X+deptof> to the bottommost children of the parse tree requires onlyone iteration of resolving.

Even in the instance of a single level process of resolving the top mosttag 13 into all possible defined variants, it is sometimes necessary toapply more than one rule as described above. For example, in addition tothe rule defined above associated with the tag <X+deptof>, the tag<deptof> may likewise have associated with it the following rule:

<deptof>=dept_of and dept

This pseudo-code is read to define the tag <deptof> as resolving to twovariants, specifically the component of the associated data element thatis not the word “of”, followed by a space, followed by “of”, as well asthe component of the associated data element that is not the word “of”in isolation. When such a rule is applied in concert with the ruledefined above, the result is four variants: “Valley Brook City of”,“Valley Brook City”, “City of Valley Brook”, and “City Valley Brook”.Note that, in the example, the application of the rules to the tags 13results in a recitation of the original data element, “Valley BrookCityof”, two variants which are likely equivalent to the manner in whicha speaker might request information, “Valley Brook City” and “City ofValley Brook”, and one unlikely construction, “City Valley Brook”. Asimilar distribution of resulting variants is possible, but notrequired, for any particular data element.

With reference to FIG. 3, there are illustrated the exemplary stepswhich may be taken to tag a data element 11 such that the associatedparse tree has more than two levels. As before, the data element 11 isultimately tagged with the tag <X+deptof>. In this instance, <deptot> ispaired with the tag <descriptor>. <descriptor> is further broken downinto tags <description> and <subdescription>. <description> isassociated with the text “Defense” and <subdescription> is associatedwith the text “Strategic Planning”. In this example, the rule associatedwith the tag <descriptor> may take the form:

<Descriptor> = <description> and <description>,_<subdescription>

While the exemplary rules illustrated above involve generatingpermutations of the text forming the data elements 11 with which eachrule is associated, the rules of the invention are not so limited.Rather the invention is broadly drawn to encompass any and all forms ofrules that encode instructions for the manipulation of data elements.For example, instead of manipulating only the text of a data element 11associated with a rule, a rule may operate to substitute other text notpart of the data element 11 when generating variants. An example of sucha parse is as follows:

Andrews Thomas Smith and Acme Attorneys data element<-------------anyorder-------><−biztype−> level 1 parse

In this example, the rule associated with tag <anyorder> generates allsubsets of names in any order forming the text “Andrews Thomas Santa andTetris”. The rule associated with tag <biztype> functions, in part, togenerate synonyms for some or all of business identifiers in the text ofthe data element 11 associated with the tag <biztype>. For example, inaddition to generating “Attorneys”, the rule associated with the tag<biztype> might also generate “Attorneys at law”, “Lawyers”, “Law Firm”,and the like. In such an instance, the rule or rules associated with thetag <biztype> are therefore specific to the text of a data element 11.Such specificity allows the data element “Tommy's Automobile Repair”tagged with the tag <biztype> to generate the variant “Tommy's CarRepair” while preventing the data element “AAA” tagged with the tag<name> from generating the variant “American Car Association”.

Note that the tag 13 names can denote a semantic content (<locality>,<biztype>) or a functional description of the rule associated with thetag 13 (<any order>). Examples of other exemplary tags 11 and thefunction of their exemplary associated rules 13 are illustrated withreference to Table 1.

TABLE 1 Tag Rule/Function <required> Will always be outputted <optional>May be skipped <any order> Words of text may be outputted in any order<bag of words> Some subset of words may be outputted <name> Name ofbusiness/brand name <biz type> Description of a business <location>City, state, street, etc. <sub description> Sub-description, departmentname <deptof> E.g. Dept of, Office of, City of, etc. <comment> E.g. (FaxLine), (24 hours) <verbatim> Output exactly as written

With reference to FIG. 5, there is illustrated a block diagram of anexemplary method of the invention. At step 1, the tags and theirassociated rules are defined. As noted above, there is no limit placedon the number or form of tags or on the rules that accompany them. Newtags may be created as needed. In addition, existing rules may bechanged and new rules may be created at any time.

At step 2, tags 13 are applied to one or more data elements 11 stored ina database. Tagging may be typically performed by one or more sentientbeings, such as a human operator. Tagging may be accomplished through aninterface, such as a graphical user interface (GUI). The GUI displayseach data element and permits the operator to apply tags to the textforming each data element. By defining a finite number of tags in step 1to be applied to the data elements in step 2, a desirable level ofuniformity is achieved when more than one operator works on the same oneor more data elements 11 stored in a database. In other embodiments thetagging operation may be performed by software in an automated fashion,with or without human assistance.

At step 3, variants for each data element are generated by a process ofapplying the rules 13 associated with the tags 11 as described above.The generated variants are stored in the database as data elements 65.With reference to FIG. 6, there is illustrated an exemplary embodimentof a system for practicing the invention. A database 67 stores the dataelements 65, tags 11, and rules 13. Database 67 may be any devicecapable of storing and retrieving digital data. Database 67 is coupledto a processor 71. Processor 71 operates to control the operation ofdatabase 67 using either hardware encoded machine instructions orsoftware encoded machine encoded instructions. Processor 71 is utilizedto perform the generation of variants from the data elements 65, tags11, and rules 13 stored on database 67, to store the variants upondatabase 67, and to instruct the inputting of data from and outputtingof data to interface 69. An interface 69 is coupled to database 67.Interface 69 may be utilized to both input data, such as data elements65, tags 11, and rules 13, into database 67 as well as to accept outputfrom database 67. Once generated, the data elements and variants 65 arestored in database 67 as individually accessible data structures,preferably text strings, for access and manipulation by processor 71.

Once the variants are generated at step 3, an operator can view thevariants on interface 69 and edit the database 67 at step 4 as desired.For example, an operator may wish to delete one or more variants fromthe database 67. This situation typically results when the rulesemployed to generate variants operate to produce one or more variantswhich lack an amount of syntactic correctness to merit retaining. Inaddition, an operator may decide to change the manner in which tags 11were assigned to a data element 65 after viewing the variants that suchtagging produced.

The invention's ability to generate variants from separately definedtags 11 and rules 13 provides a beneficial degree of control andflexibility. For example, after changing the definition of a single rule13, one can proceed to regenerate all of the variants for an entiredatabase in an automated fashion.

In an alternative exemplary embodiment of the invention, the step ofapplying tags 11 to data elements 65 may be partially or whollyautomated. After a portion of the data elements 65 in database 67 havebeen tagged, any manner of statistical analysis or parsing may beapplied to discern, and output an indication of, the propriety ofmapping specific tags to particular text strings or text stringstructures. Once so mapped, the output of the statistical parsing may beapplied to data elements 65 which have not been previously manuallytagged so as to tag them in an automated manner. In addition to a purelystatistical analysis of data elements 65, such analysis may make use ofa knowledge of the language in which the data element is written such asthat which can be extracted from resources such as Wordnet™ or othersources of lexical and semantic information.

As noted above, post generation in step 3, the data elements andvariants 65, as well as the tags 11 and rules 13 may be edited by auser, such as via interface 69. Such editing may be performed to removeunwanted variants 65, or to alter or otherwise modify existing tags 11and rules 13.

With reference to FIG. 6, there is illustrated an alternative exemplaryembodiment of the invention wherein the database 67, containing the dataelements 65 and the generated variant data elements 65, is used torespond to requests for data such as requests for information found inphone or other directory listings. As illustrated at step 4, a request,typically submitted in a textual format, is matched to a data element orvariant data element 65. In an exemplary embodiment, a statisticalmatching is performed to determine which data element or variant 65 mostclosely matches the request. In such instances, there is often timescreated a database 67 of each request and the data element or variant 65which was determined to be responsive to the request. In such instances,statistical modeling may be applied to such a database to derive tags 11and rules 13 in an automated fashion.

Such statistical modeling and statistical parsing is described morefully with reference to (1) F. Och, “Statistical Machine Translation:From Single Word Models to Alignment Templates,” Ph.D. thesis, RWTHAachen, Germany, 2002, (2) Eugene Charniak, “Statistical Parsing with aContext-Free Grammar and Word Statistics”, Proc. AAAI, pp. 598-603,1997, and (3) Michael Collins, “A New Statistical Parser Based on BigramLexical Dependencies,” Proceedings of the Thirty-Fourth Annual Meetingof the Association for Computational Linguistics, pp. 184-191, 1996.

As noted, an exemplary use of the method and resulting database 67 ofthe invention is for use in responding to queries for directory listeddata. By generating many variants, the method of the invention increasesthe likelihood that a request for data will match, or nearly match, oneof the generated variants stored on the database 67. As queries arematched to data elements 65 and their variants, it is possible to keeptrack of which data elements 65 and their variants are more or lesslikely to be requested in relation to other variants of the same dataelement 65. Such information is useful when responding to requests as itis indicative of the most probable manner in which a requester wouldprefer to receive results. In addition, such information allows one,operating in accordance with the invention, to generate questions forthe provision by a user of additional information when attempting tomatch a query to a data element or variant 65.

Although described in the context of particular embodiments, it will beapparent to those skilled in the art that a number of modifications andvarious changes to these teachings may occur. Thus, while the inventionhas been particularly shown and described with respect to one or moreexemplary embodiments thereof, it will be understood by those skilled inthe art that certain modifications or changes may be made thereinwithout departing from the scope and spirit of the invention as setforth above, or from the scope of the ensuing claims.

1. A method comprising: applying at least one tag to at least one dataelement stored in a database, said tag having at least one associatedrule; utilizing said at least one associated rule to generate at leastone variant of said data element; and storing said at least one variantin said database.
 2. The method of claim 1 wherein said at least one tagand said at least one associated rule are generated in an automatedfashion.
 3. The method of claim 1 comprising utilizing statistical oneof said data elements.
 4. The method of claim 1 comprising altering saidat least one rule in response to said generated at least one variant. 5.The method of claim 1 wherein said at least one data element comprises atext string.
 6. The method of claim 1 comprising editing at least one ofsaid variants.
 7. The method of claim 1 comprising: receiving a requestfor at least one of said at least one data element and said at least onevariant; comparing said request to said at least one data element andsaid at least one variant; and selecting at least one of said at leastone data element and said at least one variant corresponding to saidrequest.
 8. The method of claim 7 wherein said request comprises text.9. The method of claim 8 wherein said text comprises natural language.10. A system comprising: a database in which is stored at least one dataelement; means for applying at least one tag to said at least one dataelement said tag having at least one associated rule; means forutilizing said at least one associated rule to generate at least onevariant of said data element; and means for storing said at least onevariant on said database.
 11. The system of claim 10 wherein said meansfor applying comprises a user interface.
 12. The system of claim 10wherein said at least one tag is applied to said at least one dataelement manually.
 13. The system of claim 10 comprising: means forreceiving a request for at least one of said at least one data elementand said at least one variant; means for comparing said request to saidat least one data element and said at least one variant of said dataelement; and selecting at least one of said at least one data elementand said at least one variant corresponding to said request.
 14. Asignal bearing medium tangibly embodying a program of machine-readableinstructions executable by a digital processing apparatus to performoperations to generate variants of data elements, the operationscomprising: manually applying at least one tag to at least one dataelement stored in a database said tag having at least one associatedrule; utilizing said at least one associated rule to generate at leastone variant of said data element; and storing said at least one variantin said database.
 15. The signal bearing medium of claim 14 comprisingdefining said at least one tag and said at least one associated rule.16. The signal bearing medium of claim 15 wherein said at least one tagand said at least one associated rule are generated in an automatedfashion.
 17. The signal bearing medium of claim 14 comprising utilizingstatistical parsing to apply at least one of said tags to at least oneof said data elements.
 18. The signal bearing medium of claim 14 whereinsaid at least one data element is a text string.
 19. The signal bearingmedium of claim 14 comprising: receiving a request for at least one ofsaid at least one data element and said at least one variant; comparingsaid request to said at least one data element and said at least onevariant; and selecting at least one of said at least one data elementand said at least one variant corresponding to said request.
 20. A datastructure for storage in a memory for use by a text selection function,said data structure comprising at least one data element and at leastone variant of said data element wherein said at least one variant isgenerated from said at least one data element via the application of atleast one tag having at least one associated rule to said at least onedata element.